Selecting keywords representative of a document

ABSTRACT

The method makes use of a given ontology to select keywords representative of a given document. The method finds all the terms in an ontology that occur in a document, and computes their frequency of occurrences in the document. The method then propagates these values from the leaves upwards to the root of the ontology during which it weights them. The method then selects a subset of terms of the ontology structure as keywords representative of the document based on these weights.

FIELD OF THE INVENTION

The present invention relates to a method of selecting keywordsrepresentative of a document from an ontology. The invention alsorelates to a computer program product comprising code means forimplementing the steps of the method, and a computer system forperforming the steps of the method.

BACKGROUND

Traditionally, a major tool in searching collections of documents hasbeen the use of indexing. Indexing is the practice of establishingcorrespondences between a set of keywords or index terms and individualdocuments or sections thereof. Keywords are meant to indicate the topicor the content of the text, where the set of terms of keywords is chosento reflect the topical structure of the collection, such as it can bedetermined. Typically, indexing is done manually by persons who readdocuments and assign keywords to them. Manual indexing is often bothdifficult and dull; it poses great demands on consistency from indexingsession to indexing session and between different indexers. It is thesort of job that is a prime candidate for automation. Automating humanperformance is never trivial, however, even when the task at hand mayseem repetitive and non-creative at first glance. Manual indexing is aquite complex task, and difficult to emulate by computers.

Relatively recently, automatic indexing methods have been proposed. Someof these methods are based on Learning, Training, Collocation (window oftext). Others use both documents and ontological structure(s) asinformation sources in order to select the keywords. However, all thesemethods suffer from the drawback in that they do not consistently selectkeywords that are most representative of the documents.

SUMMARY

The methods of the invention make use of a given ontology to selectkeywords representative of a given document. The methods find all theterms in an ontology that occur in a document, and computes theirfrequency of occurrences in the document. The methods then select asubset of terms of the ontology structure as keywords for the documentbased on these frequency of occurrence values. In this fashion, given adocument D and a domain ontology O (taxonomy), the method assigns(selects) k representative keywords from the ontology to the document.

The method in accordance with a first arrangement, computes thefrequency of occurrences of all the terms of the ontology that occur inthe document and assigns these frequency of occurrence values tocorresponding vertices in the ontology structure. The first arrangementthen propagates these frequency of occurrence values from the leavesupwards to the root of the ontology structure, during which it weightsthem with a propagation factor. The first arrangement then outputs thewords of the ontology structure having the k largest values as thekeywords representative of the document.

The method in accordance with a second arrangement, computes thefrequency of occurrences of all the terms of the ontology that occur inthe document and assigns these frequency of occurrence values tocorresponding vertices in the ontology structure. The second arrangementthen propagates these frequency of occurrence values from the leavesupwards to the root of the ontology structure, during which it weightsthem with a propagation factor. The second arrangement then selects asub-structure of the ontology structure, which sub-structure comprises aset of unique paths from the root to the terms having non-zero weights.This selection step disambiguates the context of these terms. The secondarrangement then performs an optimization sub-process, where k verticesare selected such that a sum of weighted distances of all the verticeshaving non-zero weights to associated selected k vertices is minimized.The k terms associated with these selected k vertices are selected askeywords representative of the document.

The method in accordance with a third arrangement, computes thefrequency of occurrences of all the terms of the ontology that occur inthe document and assigns these frequency of occurrence values tocorresponding vertices in the ontology structure. The third arrangementthen performs an optimization sub-process, where k vertices are selectedsuch that a sum of weighted distances of all the vertices havingnon-zero weights to associated selected k vertices is minimized. The kterms associated with these selected k vertices are selected as keywordsrepresentative of the document.

The methods in accordance with the first, second and third arrangementsmake use of domain ontology, and generate ontology dependent keywords.These approaches provide for the selection of keywords from the ontologystructure that are representative of the document but are notnecessarily in the document themselves. Such ontologies are typicallycreated and agreed upon by experts and are therefore “standardized”.Furthermore, the methods in accordance with the arrangements can be usedto pipeline with other domain dependent analysis, which uses the sameontology. Since the methods in accordance with the arrangements do notrely on NLP-based techniques, they do not suffer from the limitations ofsuch approaches. In addition, the present methods explicitly exploit thestructure of an ontology in order to consistently select the keywords.

Another advantage of these approaches is that one can plug in differentontologies. In addition, the methods in accordance with the arrangementssupport various ontology structures, such as: Directed Acyclic Graphs(DAGs), Collection of Trees (CT) and Collection of DAGs (CD).

The steps of the methods in accordance with the arrangements arepreferably implemented as software code for execution on a computersystem.

DESCRIPTION OF DRAWINGS

A number of preferred embodiments of the present invention will now bedescribed with reference to the drawings, in which:

FIG. 1 illustrates a flow chart of a method of selecting keywordsrepresentative of a document using an ontology in accordance with afirst arrangement.

FIG. 2 illustrates a flow chart of a method of selecting keywordsrepresentative of a document using an ontology in accordance with asecond arrangement.

FIG. 3 illustrates a flow chart of a method of selecting keywordsrepresentative of a document using an ontology in accordance with athird arrangement.

FIG. 4 illustrates a flow chart of the sub-process ‘propagate_wt(vertexv)’ of step 130 of the method 100 of FIG. 1, and step 240 of the method200 of FIG. 2.

FIG. 5 illustrates a flow chart of the sub-process‘select_context(vertex v, vertex t)’ used in step 250 of the method ofFIG. 2.

FIG. 6 illustrates a flow chart of the sub-process ‘locate_fac(T, C,integer k)’ used in step 260 of the method of FIG. 2, and step 330 ofFIG. 3.

FIG. 7 is a schematic representation of a computer system suitable forperforming the techniques described herein.

DETAILED DESCRIPTION

A brief review of terminology and notation used herein is firstundertaken, then there is provided a detailed description of the methodsof selecting keywords representative of a document using an ontology inaccordance with first, second and third arrangements, a detaileddescription of computer software for implementing the steps of themethods, and a detailed description of computer hardware that issuitable for executing such computer software.

Terminology

Ontology

In this document, the term “ontology” and “taxonomy” are usedsynonymously. An Ontology can have many possible structures; the mostcommon among which are directed acyclic graphs (DAGs) and a collectionof trees (CT). The methods described in this document work with both ofthem and a third structure, collection of DAGs (CD). A common feature ofthese Ontology structures is that they each comprise one or more rootvertices, a plurality of descendent vertices, and a plurality ofdescendent leaves, where the descendent vertices and leaves correspondto respective terms, that is words, in the ontology. An ontology thathas a DAG structure may have a vertex that has multiple parents, whichis a source of ambiguity. An ontology that has a CT structure comprisesa number of vertices, where each vertex has only one parent. A vertexmay appear in multiple trees. In this CT structure, transitivity doesnot hold across trees. An ontology that has a CD structure comprisesmultiple DAGs. In this CD structure a vertex may have multiple parentsand may appear in multiple DAGs. Also transitivity does not hold acrossthe DAGs.

Ambiguity

A term is ambiguous when there are several paths in the ontology leadingto it. Ambiguity arises in a DAG ontology structure when there areseveral paths to a single vertex. Ambiguity arises in CT/CD ontologystructures where there are multiple vertices denoting the same term.

Context

A context is defined as a unique path in the ontology from the root tothe term.

Notation

P_(t) denotes the set of all paths from the root to a term t in theentire ontology.

w_(t) denotes the frequency of occurrence of term t in the document.

f is a propagation factor in [0,1] and is independent of the weightw_(v). Namely, the propagation factor f can take a value between 0 and 1inclusive. The propagation factor f determines what fraction of theweight w_(v) contributes to the parent in the tree. Preferably, f is aconstant, however, in alternative embodiment(s), f can be tunable,namely a function of, the level in the tree, the number of children, aweight on the edge, or just any arbitrary number. Furthermore, theseedge-weights may be used to incorporate an experts domain knowledge. Forexample, in the MeSH ontology, “Cyclin A” is a child of “cyclin” whichis a child of “growth substances”. As the former parent-childrelationship is “stronger” than the latter, this can be captured byassigning weight to the edges, which can be used in defining thepropagation factor f.

Methods

Turning now to FIG. 1, there is shown a flow chart of a method 100 ofselecting keywords representative of a document using an ontology inaccordance with a first arrangement. For ease of explanation, the method100 is described with reference to a single ontology structurecomprising a Directed Acyclic Graph (DAG), however the method 100 is notintended to be limited to a single ontology structure or a ontologystructure comprising a DAG. The method 100 can also be used on aplurality of ontologies and also on other ontology structures such ascollection of trees (CT) and a collection of DAGs (CD). Furthermore, themethod 100 can also be used on a part of document. Generally speaking,the method 100 computes the frequency of occurrences of all the terms ofthe ontology that occur in the document and assigns these frequency ofoccurrence values to corresponding vertices in the ontology structure.The method 100 then propagates these frequency of occurrence values fromthe leaves upwards to the root of the ontology structure, during whichit weights them with a propagation factor. The method 100 then outputsthe words of the ontology structure having the k largest weighted valuesas the keywords representative of the document. In this way, the presentmethod 100 consistently selects k keywords from the ontology structurethat are generally the most representative of the document. It will alsobe apparent that the keywords are selected from the ontology structureand not from the document itself thus enabling the selection ofrepresentative keywords that do not necessarily appear in the document.

The method 100 commences at step 110 where the document and ontology areretrieved and any necessary parameters are initialised. The method 100then proceeds to step 120, where the method 100 scans the document andcomputes the frequency of occurrence wt of each term t of the ontologyin the document.

After completion of step 120, the method 100 then proceeds to step 130,where the method 100 calls a sub-process 400 ‘propagate_wt(vertex v)’and passes the root vertex of the DAG of the ontology structure as thevertex v to this sub-process 400.

The sub-process ‘propagate_wt(root)’ 400 recomputes and stores for eachleaf and vertex v of the DAG an updated frequency occurrence valuew_(v). This updated frequency occurrence value w_(v) in the case of avertex v equals the sum of the old frequency occurrence value w_(v)associated with that vertex v and the updated frequency occurrencevalues of its immediate descendants times the propagation factor(s)f_(c) for those descendents. The frequency occurrence value for a leaf vremains unchanged. This sub-process 400 will be described below in moredetail with reference to FIG. 4.

After completion of the sub-process 400, the method 100 proceeds to step140, where the method 100 calls a sub-process select_keywords(k) 140.This sub-process 140 takes as input an integer value k and thentraverses the DAG ontology structure and selects and returns those wordswith the k largest updated values w_(t) as the keywords representativeof the document. Specifically, the sub-process 140 scans the entire DAGontology structure and generates a list of k terms having the largestupdated values in the DAG ontology structure, and then returns thatlist. After completion of the sub-process 140, the method 100 thenterminates 150. In this arrangement, the method utilises purelyfractional weight-propagation, i.e., the notion that a fraction of theweight may be transferred from a vertex to its parent, progressively,with the intention that the vertex which has a lot of weighteddescendants gets chosen as the keywords. To ensure that the effect of avertex does not show up “unabatedly” in a high ancestor, at each level,the weight is multiplied by a fraction.

Turning now to FIG. 2, there is shown a flow chart of a method 200 ofselecting keywords in a document using an ontology in accordance with asecond arrangement. For ease of explanation, the method 200 is describedwith reference to a single ontology structure comprising a DirectedAcyclic Graph (DAG), however the method 200 is not intended to belimited to a single ontology structure or a ontology structurecomprising a DAG. The method 200 can also be used on a plurality ofontologies and also on other ontology structures such as collection oftrees (CT) and a collection of DAGs (CD). Furthermore, the method 200can also be used on a part of document. Generally speaking, the method200 computes the frequency of occurrences of all the terms of theontology that occur in the document and assigns these frequency ofoccurrence values to corresponding vertices in the ontology structure.The second arrangement then propagates these frequency of occurrencevalues from the leaves upwards to the root of the ontology structure,during which it weights them with a propagation factor. The secondarrangement then selects a sub-structure of the ontology structure,which sub-structure comprises a set of unique paths from the root to theterms t having non-zero weights. This selection step disambiguates thecontext of these terms t. Finally, the second arrangement performs agreedy facility location sub-process, wherein all vertices havingnon-zero weights are considered as clients that have to be served byopening k facilities at k vertices such that a sum of weighted distancesof all the clients to their associated facilities is minimized.

In this way, the present method 200 consistently selects k facilities,that is k keywords, from the ontology structure that are generally themost representative of the document. It will also be apparent that thekeywords are selected from the ontology structure and not from thedocument itself thus enabling the selection of representative keywordsthat do not necessarily appear in the document.

The method 200 commences at step 210 where the document and ontology areretrieved and any necessary parameters are initialized. The method 200then proceeds to step 220, where the method 200 scans the document andcomputes the frequency of occurrence wt of each term t of the ontologyin the document. The method 200 then proceeds to step 230 where avariable T for storing the indices of the vertices of a sub-tree of theDAG ontology structure is initialized and set to Null. Also, during step230 a variable C, for storing a sub-list of the vertices of the DAG isinitialized and set to Null.

After these two variables T and C have been set to Null, the method 200then proceeds to step 240, where the method 200 calls the sub-process400 ‘propagate_wt(vertex v)’, and passes the root vertex of the DAG ofthe ontology structure as the vertex v to this sub-process 400.

As mentioned above, the sub-process ‘propagate_wt(root)’ 400 recomputesand stores for each leaf and vertex v of the DAG an updated frequencyoccurrence value w_(v). This updated frequency occurrence value w_(v) inthe case of a vertex v equals the sum of the old frequency occurrencevalue w_(v) associated with that vertex v and the updated frequencyoccurrence values of its immediate descendants times the propagationfactor(s) f_(c) for those descendents. The frequency occurrence valuefor a leaf v remains unchanged. This sub-process 400 will be describedbelow in more detail with reference to FIG. 4.

After completion of step 240, the method 200 then proceeds to step 250.This step 250 is a loop and performs a first sub-step C=C+t, and thenperforms a second sub-step T=T+select_context(root,t) for each ontologyterm t that occurs in the document. It should be noted that thesesub-steps are not performed on ontology terms t that do not occur in thedocument. Specifically, the loop traverses the DAG structure andperforms these sub-steps only on those terms t associated with verticest that have non-zero weights f.w_(v).

During a pass of the loop for a current vertex t that has a non-zeroweight f.w_(v), the first sub-step C=C+t, appends the current vertex tto the list C. Thus after completion of the loop the variable C containsa list of all those vertices of the DAG that have non-zero weightsf.w_(v). Also, the operation T=T+select_context(root,t) appends to asub-tree T the unique path from the root to the term t associated withthe current vertex t. Thus after the completion of the loop, thevariable T contains a sub-tree T of the DAG ontology, which sub-tree Tcomprises a list of the unique paths from the root to the terms t thathave non-zero weights. In this fashion, the T=T+select_context(root,t)is used to disambiguate the context of the terms t so that unique pathsfrom the root to the respective terms are selected from the set of allpaths Pt. The operation T=T+select_context(root,t) achieves this bycalling a sub-process ‘context_context(root,t)’ 500 for each currentvertex t that has a non-zero weight, which sub-process 500 returns alist of vertices defining the unique path from the root to that term.This sub-process ‘select_context(root,t)’ 500 is described in moredetail with reference to FIG. 5. In principle other disambiguationsub-processes may be used as alternatives.

After completion of step 250, the method 200 then proceeds to step 260where a sub-process ‘locate_fac(T, C, k)’ 600 is performed. Thissub-process ‘locate_fac(T, C, k’) 600 is a fractional greedy optimalfacility location sub-process and takes as input the variable T, thevariable C, and an integral variable k that indicates the number ofkeywords to be selected. This sub-process then returns k key words thatare representative of the document. This sub-process 600 will bedescribed below in more detail with reference to FIG. 6. Aftercompletion of the sub-process 260, the method 200 then terminates 270.

Turning now to FIG. 3, there is shown a flow chart of a method 300 ofselecting keywords representative of a document using an ontology inaccordance with a third arrangement. For ease of explanation, the method300 is again described with reference to a single ontology structurecomprising a Directed Acyclic Graph (DAG), however the method 300 is notintended to be limited to a single ontology structure or a ontologystructure comprising a DAG. The method 300 can also be used on aplurality of ontologies and also on other ontology structures such ascollection of trees (CT) and a collection of DAGs (CD). Furthermore, themethod 300 can also be used on a part of document.

Generally speaking, the method 300 computes the frequency of occurrencesof all the terms of the ontology that occur in the document and assignsthese frequency of occurrence values to corresponding vertices in theontology structure. The third arrangement then performs a greedyfacility location sub-process, wherein all vertices having non-zerofrequency of occurrence values are considered as clients that have to beserved by opening k facilities such that a sum of weighted distances ofall the clients to their associated facilities is minimized. In thisway, the present method 300 consistently selects k keywords from theontology structure that are generally the most representative of thedocument. It will also be apparent that the keywords are selected fromthe ontology structure and not from the document itself thus enablingthe selection of representative keywords that do not necessarily appearin the document.

The method 300 commences at step 310 where the document and ontology areretrieved and any necessary parameters are initialized. The method 300then proceeds to step 320, where the method 300 scans the document andcomputes and stores the frequency of occurrence w_(t) of each term t ofthe ontology in the document After completion of step 320, the method300 then proceeds to step 330 where the sub-process ‘locate_fac (O, C,k)’ 600 is performed. This sub-process ‘locate_fac(O, C, k)’ 600 is thesame fractional greedy optimal facility location sub-process that isused in the second arrangement but in this third arrangement takes asinput the ontology structure O, a variable C and a integral variable k.The variable C is a list of all vertices v that have non-zero weightsand the variable k is an integer which indicates the number of keywordsto be selected. This sub-process 600 then returns k key words that arerepresentative of the document. The sub-process ‘locate_fac(O, C, k)’600 is described below in more detail with reference to FIG. 6. Aftercompletion of step 330, the method 300 then terminates 340.

Turning now to FIG. 4, there is shown a flow chart of the sub-process‘propagate_wt vertex v)’ as used in steps 130, and 240 of the methods ofFIGS. 1 and 2 respectively. The sub-process 400 ‘propagate_wt (vertexv)’ is a recursive sub-process and commences at steps 130 and 240 wherethe root vertex is initially passed to the sub-process 400 as thecurrent vertex v. The sub-process 400 then proceeds to a decision block420, where a check is made whether the current vertex v is a leaf. Ifthe decision block 420 determines that the current vertex v is a leafthen the sub-process 400 proceeds to step 450 where the sub-process 400returns the value f.w_(v), which value is equal to the propagationfactor f for the current leaf times the frequency of occurrence valuew_(v) for the current leaf v. As mentioned above the propagation factorf is a value independent of the weight w_(v), and can be a predeterminedconstant, or may be variable whose value is decided based upon theconsideration of many factors. If on the other hand, the decision block420 determines the current vertex v is not a leaf, then the sub-process400 proceeds to step 430.

The sub-process 400 during step 430 computes the updated frequency ofoccurrence value w_(v) for the current vertex v. As mentioned above,this updated frequency occurrence value w_(v) in the case of a vertex vequals the sum of the old frequency occurrence value w_(v) associatedwith that vertex v and the updated frequency occurrence values of itsimmediate descendants times the propagation factor(s) f_(c) associatedwith those descendents. Namely, the updated frequency occurrence valuew_(v) for a vertex v equals${w_{v} = {w_{v} + {\sum\limits_{c}{f_{c} \cdot w_{c}}}}},$where w_(c) are the previously updated frequency occurences values forthe child vertices of the vertex v. The step 430 achieves this bydetermining, for each child vertex c of the current vertex v, the sumw_(v)=w_(v)+propagate_wt(c), where the sum recursively calls thesub-process propagate_wt(c) for each child vertex c of the currentvertex v. After the completion of step 430, the sub-process 400 proceedsto step 440, where the sub-process 400 returns the current value of thefrequency occurrence value f.wv. After the completion of either of thesteps 450 or step 440, the sub-process 400 then terminates 460, and thenthe respective methods of FIGS. 1 and 2 then proceeds to step 140 and250. In this fashion, the sub-process 400 computes the updated frequencyof occurrence values w_(v), whereby these values w_(v) increase in valuealong all paths from the leafs to the root of the ontology. In this way,a fraction of the frequency of occurrence values are propagated up thetree from the leaves to the root.

Turning now to FIG. 5, there is shown a flow chart of the sub-processselect_context(vertex v, vertex t) of step 250 of the method of FIG. 2.As mentioned previously, the sub-process 500 select_context(vertex v,vertex t) is called for each term t in the ontology that occurs in thedocument, that is called for each term that has a non-zero weightedvertex t. The sub-process 500 select_context vertex v, vertex t) is arecursive sub-process and commences at step 510 where the root vertex isinitially passed to the sub-process 500 as the current vertex v and thecurrent vertex t is passed to the sub-process 500 as vertex t. Thesub-process 500 then proceeds to a decision block 520, where a check ismade whether the current vertex v is the same as the current vertex t.If the decision block 520 determines that the current vertices v and tare identical, then the sub-process 500 proceeds to step 550, where thesub-process 500 returns a Null value and the sub-process 500 terminates560. On the other hand, if the decision block 520 determines that thecurrent vertices v and t are not identical, then the sub-process 500proceeds to step 530.

The sub-process 500 during step 530 selects the immediately descendant(ie. child) vertex c of the current vertex v that is an ancestor of thecurrent vertex t and that has the largest weight f.w_(v). After thecompletion of step 530, the sub-process 500 proceeds to step 540, wherethe sub-process 500 performs a return operation return(v,select_context(c, t)). The second parameter of this return operationrecursively calls the sub-process 500 ‘select_context(c, t)’ with thecurrent vertex v set to the selected child vertex c. After thecompletion of the step 540, the sub-process 500 then terminates 560, andreturns to the method 200 that called the sub-process 500. In thisfashion, the sub-process 500 selects the most appropriate context foreach of the ontology terms t occurring in the document. Specifically thesub-process 500 for a term t returns a unique path in the form of aseries of vertices commencing at the root vertex and finishing at thevertex t followed the Null value. The sub-process 500 selects the uniquepath to the term t in the ontology in such a manner that where there areseveral paths branching from a single ancestor vertex of the unique pathto a single descendant vertex, the sub-process 500 selects thatimmediately descendant vertex of the single ancestor vertex that has thelargest weight as the next member of the unique path. In this way, thecombination of the sub-processes 400 and 500 consistently select aunique path for each term, and thus are able to disambiguate terms inthe document.

Turning now to FIG. 6, there is shown a flow chart of the sub-processlocate_fac(T, C, integer k) 600 used in step 260 of the method of FIG.2, and also in step 330 of FIG. 3. Specifically, this fractional greedyfacility location sub-process 600 selects k facilities that minimizes acost, which cost equals the total of the servicing costs for all theclients. The sub-process 600 in computing this cost opens k facilitiesat k vertices of the tree T, which k facilities serve clients C thelatter being the non-zero vertices of the tree T. The servicing cost ofa client is computed as the distance of that client to its associatedfacility multiplied by a weight associated with the client. Thisassociated weight equals the number of occurrences that the wordassociated with the client (viz vertex) appears in the document, and thedistance between a client and a facility is the number of edges betweenthat client and that facility. It is important to recognise that thisweight is the initial weight (which is based on the number ofoccurrences in the document) and not the updated weights generated bythe propagate_wt process 400. Also, this servicing cost is subject tothe constraints that a facility can only serve descendant clients and aclient can be served by multiple facilities. Accordingly, in the case ofa client being served by multiple facilities, the servicing cost of thisclient is the total of the servicing costs for this client to therespective multiple facilities. The cost of an unserved client is setinfinitely high, ie. very high compared to the other costs, so that nosolution with unsatisfied clients can be the optimal solution. In thiscase, the number k of facilities to be opened is adjusted so as toobtain an optimal, viz minimal, solution.

The greedy facility location sub-process locate(T, C, integer k) 600generates an optimal solution of the following: $\begin{matrix}{{\min{\sum\limits_{\upsilon \in V}{W_{v} \cdot {d( {v,F_{v}} )}}}}{{d( {v,F_{v}} )} = {\sum\limits_{{Fiserves}\quad v}{d( {v,F_{i}} )}}}} & {{Eqn}\quad(1)}\end{matrix}$where d(υ, F_(υ)) denotes the distance between a vertex υ and itsassociated set of facilities F_(υ), summed over the distance between avertex v and each one of its facilities F_(i), where the distanced(υ_(i), F_(i)) is the number of edges between the vertex υ and thefacility F_(i), and where W_(υ) is the number of occurrences that theword associated with the vertex υ appears in the document. A vertex vmay be served entirely by a single facility F_(i), or may be partiallyserved by all the facilities F_(i), I<=i<=k.

The greedy facility location sub-process locate_fac(T, C, integer k) 600commences at step 610, where the variables T, C and k are passed to thesub-process 600 and other necessary parameters are initialised. Asmentioned previously, the method in accordance with the thirdarrangement passes the entire DAG ontology tree structure O to thesub-process 600 via means of this variable T, viz locate_fac(O,C,integerk). On the other hand, the method in accordance with the secondarrangement passes a sub-tree T of the DAG ontology structure O to thesub-process 600 via this variable T, viz locate_fac(T,C,integer k). Inthe later arrangement, this sub-tree T comprises a list of the uniquepaths from the root to the terms t that have non-zero weights. For theease of explanation of the sub-process 600, the ontology tree structureO and the sub-tree structure T passed to the sub-process 600 will bothbe referred to as tree T. The variable C comprises a list of allclients, namely all vertices v of the tree T that have non-zero weights,and the integer k represents the number of keywords to be selected.

After step 610, the sub-process 600 then computes 620 the facilitycapacity C. This facility capacity C equals the sum of all the weightsw_(v) of the tree T divided by the maximum number of facilities k. Asmentioned previously, these weights w_(v) are associated with respectivevertices of the tree, and each weight equals the number of occurrencesthat a word associated with the vertex appears in the document. Thisweight is the initial weight (which is based on the number ofoccurrences in the document) and not the updated weights generated bythe propagate_wt process 400. After computation of the facility capacityC, the sub-process 600 then deletes all leaves of the tree T that haveweights w_(v) equal to zero.

After step 630, the sub-process 600 enters a loop 640-680, where thesub-process 600 first selects any leaf v of the tree T not alreadyprocessed by the loop for processing. The sub-process 600 then proceedsto a decision block 650, where the sub-process 600 checks whether theweight w_(v) associated with the selected leaf v is greater than orequal to the facility capacity C.

If the decision block 650 determines that w_(v)>=C for the selected leafv, then the sub-process opens 660 a facility at the selected leaf v. Thesub-process 600 then propagates 670 the weight [w_(v)−C] to the parentnode of the selected leaf v. Specifically, the weight of the parent ofthe selected leaf v is updated according tow_(parent(v))=w_(parent(v))+[w_(v)−C]. After completion of thepropagation step 670, the sub-process 600 proceeds to decision block680.

If on the other hand, the decision block 650 determines that w_(v)<C forthe selected leaf v, then the sub-process 600 propagates 665 the weightw_(v) of the selected leaf to its parent node. Specifically, the weightof the parent of the selected leaf is updated according tow_(parent(v))=w_(parent(v))+w_(v). After this updating step 665, thesub-process 600 then deletes 675 the selected leaf v from the tree T.After completion of the deletion step 675, the sub-process 600 proceedsto decision block 680.

The decision block 680 checks whether or not k facilities have beenopened. In the event the decision block 680 returns false, thesub-process 600 returns to step 640 for processing of a leaf notpreviously processed. It should be noted that in the case where w_(v)<Cfor a selected leaf, the sub-process 600 deletes the selected leaf fromthe tree T. The sub-process 600 in this case results in a new set ofleaves (a shunken tree T′) to be subsequently processed by the loop640-680. In the case where w_(v)>=C, the sub-process 600 does not deletethe selected leaf and in the next pass of step 640, the sub-process 600selects from the tree (T or T′ as the case may be) a leaf that has notbeen previously processed.

The sub-process 600 continues in this fashion until the decision block680 finally determines that k facilities have been opened, and thesub-process 600 terminates.

In this way, the modeling of the key selection as a capacitated facilitylocation problem results in a reliable and robust selection of keywordsand the greedy facility location sub-process 600 is an efficient processfor solving that problem. In addition, the greedy facility locationsub-process 600 guarantees optimally where a tree T structure isextracted from an ontology O using disambiguation as in the secondarrangement. However, in the third arrangement where the ontology O isleft as is, the sub-process 600 does not guarantee optimality. But, thethird arrangement whilst not giving optimal results it is expected toproduce useful results.

Other facility location sub-processes for solving the aforementionedfacility location problem (Eqn (1)) may be used in the second, and thirdarrangements instead of the fractional greedy optimal locationsub-process described herein with reference to FIG. 6. In particular, anoptimal dynamic programming based sub-process or an optimal fractionalgreedy sub-process can be used for ontology structures comprising trees(CT). In further variations, a greedy static sub-process or a greedyadaptive sub-process can be used for ontology structures comprising aDAG. Furthermore, capacitated and uncapacitated versions can be used.

As can be seen, the methods in accordance with the first, second andthird arrangements are not limited to any specific ontology, anddifferent ontologies may be plugged in depending on the nature and levelof the keyword representation that is required. In this sense thesemethods are independent of domain ontology (taxonomy),

In a variation of the first and second arrangements the propagationfactor can be tunable. For example, the propagation factor f can be madea function of the edge weight, level depending on the actual ontologyused.

The methods in accordance with the first and third arrangements can workwith any of the ontology structures DAG, CD and CT. The method inaccordance with the second arrangement, in addition to working with DAGontology structures, can also work with CT ontologies subject to somemodifications to selecting the context, that is the context selectionsub-process 300. In the case of CT structures, a number of alternativeways of selecting the context are possible. In all of thesealternatives, the modified context selection sub-process first finds allthe paths leading from the root to the term. In one alternative themodified context selection sub-process then selects the path that hasthe maximum average weight per vertex. In another alternative themodified context selection sub-process then selects the path that hasthe vertex with the largest weight. In still another alternative themodified context selection sub-process selects the path with the largestsum of weights. The method in accordance with the second arrangement canalso be used with CD ontologies subject to some modifications to thecontext selection sub-process 300. The modified method for CD ontologiescan be implemented by performing the context selection sub-process 300independently on each of the DAGs, which results in a collection oftrees, and then implementing one of aforementioned modified contextselection sub-processes on these collection of trees.

Computer Software

The steps of the methods 100, 200, and 300 are preferably implemented assoftware code means for execution on a computer system such as thatdescribed with reference to FIG. 7. Exemplary pseudo software code forimplementing the steps of the method 100 is illustrated as follows: scanthe document and compute wt for each ontology-term t; propagate_wt(root); select_keywords(k); Sub-Routines: propagate_wt(ν) if (v is a leaf)return f.wν else for each child c of ν, wν = wν + propagate_wt(c);return f.wν select_keywords(k) return the top k words with maximumweight f.w_(ν)

Exemplary pseudo software code for implementing the steps of the method200 is illustrated as follows: scan the document and compute wt for eachontology-term t; T = Null; C = Null; propagate_wt(root); for eachontology-term t in the document C += t; T += select_context(root,t);//used to disambiguate the context of t so that a unique path is//selected from root to t. In principle, other disambiguationsub-//processes may used as alternatives locate_fac(T,C,k): //runs afractional greedy optimal facility location sub-process on //a tree Tfor clients in C to place k facilities. Sub-Routines: propagate_wt(ν) if(ν is a leaf) return f.wν else for each child c of ν, wν = wν +propagate_wt(c) ; return f.wν select_context(ν,t) if (ν == t), returnnull ; else select the largest weight child c or ν that is an ancestorof t. // Note that in the case of a DAG, t is a unique vertex, //whereas in the case of CT/CD, t may appear as a // collection ofvertices. return (ν,select_context(c,t)) ;

Exemplary pseudo software code for implementing the steps of the method300 is illustrated as follows: scan the document and compute wt for eachontology-term t; locate_fac(T,C,k): //runs a fractional greedy optimalfacility location sub-process on //a tree T for clients in C to place nfacilities.

The aforementioned pseudo code is not intended to be limited to anyparticular programming language and implementation thereof. It will beappreciated that a variety of programming languages and implementationsthereof may be used to implement the teachings of the invention asdescribed herein.

Computer Hardware

FIG. 7 is a schematic representation of a computer system 1000 of a typethat is suitable for executing computer software for selecting keywordsrepresentative of a document using an ontology. Computer softwareexecutes under a suitable operating system installed on the computersystem 1000, and may be thought of as comprising various software codemeans for achieving particular steps of the methods 100, 200 or 300.

The components of the computer system 1000 include a computer 1020, akeyboard 1010 and mouse 1015, and a video display 1090. The computer1020 includes a processor 1040, a memory 1050, input/output (I/O)interfaces 1060, 1065, a video interface 1045, and a storage device1055.

The processor 1040 is a central processing unit (CPU) that executes theoperating system and the computer software executing under the operatingsystem. The memory 1050 includes random access memory (RAM) andread-only memory (ROM), and is used under direction of the processor1040.

The video interface 1045 is connected to video display 1090 and providesvideo signals for display on the video display 1090. User input tooperate the computer 1020 is provided from the keyboard 1010 and mouse1015. The storage device 1055 can include a disk drive or any othersuitable storage medium.

Each of the components of the computer 1020 is connected to an internalbus 1030 that includes data, address, and control buses, to allowcomponents of the computer 1020 to communicate with each other via thebus 1030.

The computer system 1000 can be connected to one or more other similarcomputers via a input/output (I/O) interface 1065 using a communicationchannel 1085 to a network, represented as the Internet 1080.

The computer software may be recorded on a portable storage medium, inwhich case, the computer software program is accessed by the computersystem 1000 from the storage device 1055. Alternatively, the computersoftware can be accessed directly from the Internet 1080 by the computer1020. In either case, a user can interact with the computer system 1000using the keyboard 1010 and mouse 1015 to operate the programmedcomputer software executing on the computer 1020.

Other configurations or types of computer systems can be equally wellused to execute computer software that assists in implementing thetechniques described herein.

CONCLUSION

Various alterations and modifications can be made to the techniques andarrangements described herein, as would be apparent to one skilled inthe relevant art.

1. A method of selecting keywords representative of a document from anontology, said method comprising: computing, for each term in theontology, a value representative of a frequency of occurrence of saidterm in the document; and selecting a subset of terms of the ontology askeywords representative of the document based on said value.
 2. A methodof selecting keywords representative of a document from an ontology,wherein the ontology comprises terms arranged in a tree-like structure,said method comprising: computing, for each term in the ontology, afirst value representative of a frequency of occurrence of said term inthe document; assigning said first value to corresponding vertices inthe ontology; propagating said first value from leaf vertices of theontology upwards to the one or more root vertices of the ontology byassigning to each vertex a second value, wherein said second valueequals a sum of said first value of the vertex plus the second values ofimmediate descendent vertices of said vertex each multiplied by acorresponding propagation factor; and selecting k terms of the ontologyas keywords representative of the document that have a largest k secondvalue.
 3. A method of selecting keywords representative of a documentfrom an ontology, wherein the ontology comprises terms arranged in atree-like structure having one or more root vertices, vertices and leafvertices, said method comprising: computing, for each term in theontology, a first value representative of a frequency of occurrence ofsaid term in the document; assigning first values to correspondingvertices in the ontology; propagating said first values from the leafvertices of the ontology upwards to the one or more root vertices of theontology by assigning to each vertex a second value, wherein said secondvalue equals a sum of said first value of the vertex plus the secondvalues of immediate descendent vertices of said vertex each multipliedby a corresponding propagation factor; generating a sub-structure of theontology, wherein the sub-structure comprises a unique path for eachterm so as to disambiguates a context of the terms; and performing anoptimization process, wherein k vertices are selected such that a sum ofweighted distances of all the vertices having non-zero second values toassociated selected k vertices is minimized, and wherein k termsassociated with the selected k vertices are selected as keywordsrepresentative of the document.
 4. The method of claim 3, wherein theoptimization process comprises a greedy facility location process. 5.The method of claim 3, wherein the optimization process comprises agreedy facility location process, wherein the vertices having non-zerosecond values are clients, the selected k vertices are facilitiesserving the clients, the weighted distance between a client and afacility is a number of edges of the tree-like structure between theclient and the facility multiplied by a sum of the second values of thevertices in a subtree of the facility, wherein facilities can serve onlydescendent clients and clients can be served by multiple facilities. 6.The method of claim 3, wherein the optimization process comprises anoptimal dynamic programming based process.
 7. A method of selectingkeywords representative of a document from an ontology, wherein theontology comprises terms arranged in a tree-like structure having one ormore root vertices, vertices and leaf vertices, said method comprising:computing, for each term in the ontology, a first value representativeof a frequency of occurrence of said term in the document; assigningfrequency of occurrence values to corresponding vertices in theontology; and performing an optimization process, wherein k vertices areselected such that a sum of weighted distances of all the verticeshaving non-zero first values to associated selected k vertices isminimized, and wherein k terms associated with the selected k verticesare selected as keywords representative of the document.
 8. The methodof claim 7, wherein the optimization process comprises a greedy facilitylocation process.
 9. The method of claim 7, wherein the optimizationprocess comprises a greedy facility location process, wherein thevertices having non-zero second values are clients, the selected kvertices are facilities serving the clients, the weighted distancebetween a client and a facility is a number of edges of the tree-likestructure between the client and the facility multiplied by a sum of thesecond values of the vertices in a subtree of the facility, whereinfacilities can serve only descendent clients and clients can be servedby multiple facilities.
 10. The method of claim 7, wherein theoptimization process comprises an optimal dynamic programming basedprocess.
 11. A computer program product for selecting keywordsrepresentative of a document from an ontology, the computer programproduct comprising computer software recorded on a computer-readablemedium for performing a method comprising: computing, for each term inthe ontology, a value representative of a frequency of occurrence ofsaid term in the document; and selecting a subset of terms of theontology as keywords representative of the document based on said value.12. A computer system for selecting keywords representative of adocument from an ontology, the computer system comprising computersoftware recorded on a computer-readable medium for performing a methodcomprising: computing, for each term in the ontology, a valuerepresentative of a frequency of occurrence of said term in thedocument; and selecting a subset of terms of the ontology as keywordsrepresentative of the document based on said value.
 13. A computerprogram product for selecting keywords representative of a document froman ontology, the computer program product comprising computer softwarerecorded on a computer-readable medium for performing a methodcomprising: computing, for each term in the ontology, a first valuerepresentative of a frequency of occurrence of said term in thedocument; assigning said first value to corresponding vertices in theontology; propagating said first value from leaf vertices of theontology upwards to the one or more root vertices of the ontology byassigning to each vertex a second value, wherein said second valueequals a sum of said first value of the vertex plus the second values ofimmediate descendent vertices of said vertex each multiplied by acorresponding propagation factor; and selecting k terms of the ontologyas keywords representative of the document that have a largest k secondvalue.
 14. A computer system for selecting keywords representative of adocument from an ontology, the computer system comprising computersoftware recorded on a computer-readable medium for performing a methodcomprising: computing, for each term in the ontology, a first valuerepresentative of a frequency of occurrence of said term in thedocument; assigning said first value to corresponding vertices in theontology; propagating said first value from leaf vertices of theontology upwards to the one or more root vertices of the ontology byassigning to each vertex a second value, wherein said second valueequals a sum of said first value of the vertex plus the second values ofimmediate descendent vertices of said vertex each multiplied by acorresponding propagation factor; and selecting k terms of the ontologyas keywords representative of the document that have a largest k secondvalue.
 15. A computer program product for selecting keywordsrepresentative of a document from an ontology, the computer programproduct comprising computer software recorded on a computer-readablemedium for performing a method comprising: computing, for each term inthe ontology, a first value representative of a frequency of occurrenceof said term in the document; assigning first values to correspondingvertices in the ontology; propagating said first values from the leafvertices of the ontology upwards to the one or more root vertices of theontology by assigning to each vertex a second value, wherein said secondvalue equals a sum of said first value of the vertex plus the secondvalues of immediate descendent vertices of said vertex each multipliedby a corresponding propagation factor; generating a sub-structure of theontology, wherein the sub-structure comprises a unique path for eachterm so as to disambiguates a context of the terms; and performing anoptimization process, wherein k vertices are selected such that a sum ofweighted distances of all the vertices having non-zero second values toassociated selected k vertices is minimized, and wherein k termsassociated with the selected k vertices are selected as keywordsrepresentative of the document.
 16. A computer system for selectingkeywords representative of a document from an ontology, the computersystem comprising computer software recorded on a computer-readablemedium for performing a method comprising: computing, for each term inthe ontology, a first value representative of a frequency of occurrenceof said term in the document; assigning first values to correspondingvertices in the ontology; propagating said first values from the leafvertices of the ontology upwards to the one or more root vertices of theontology by assigning to each vertex a second value, wherein said secondvalue equals a sum of said first value of the vertex plus the secondvalues of immediate descendent vertices of said vertex each multipliedby a corresponding propagation factor; generating a sub-structure of theontology, wherein the sub-structure comprises a unique path for eachterm so as to disambiguates a context of the terms; and performing anoptimization process, wherein k vertices are selected such that a sum ofweighted distances of all the vertices having non-zero second values toassociated selected k vertices is minimized, and wherein k termsassociated with the selected k vertices are selected as keywordsrepresentative of the document.
 17. A computer program product forselecting keywords representative of a document from an ontology, thecomputer program product comprising computer software recorded on acomputer-readable medium for performing a method comprising: computing,for each term in the ontology, a first value representative of afrequency of occurrence of said term in the document; assigningfrequency of occurrence values to corresponding vertices in theontology; and performing an optimization process, wherein k vertices areselected such that a sum of weighted distances of all the verticeshaving non-zero first values to associated selected k vertices isminimized, and wherein k terms associated with the selected k verticesare selected as keywords representative of the document.
 18. A computersystem for selecting keywords representative of a document from anontology, the computer system comprising computer software recorded on acomputer-readable medium for performing a method comprising: computing,for each term in the ontology, a first value representative of afrequency of occurrence of said term in the document; assigningfrequency of occurrence values to corresponding vertices in theontology; and performing an optimization process, wherein k vertices areselected such that a sum of weighted distances of all the verticeshaving non-zero first values to associated selected k vertices isminimized, and wherein k terms associated with the selected k verticesare selected as keywords representative of the document.